Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document node allocatable enhancements #4532

Merged
merged 1 commit into from Jun 19, 2017
Merged

Document node allocatable enhancements #4532

merged 1 commit into from Jun 19, 2017

Conversation

derekwaynecarr
Copy link
Member

this is a major feature in OCP 3.6 for node reliability.

any improvements to readability and or clarity are appreciated.

/cc @mburke5678 @sjenning @eparis

@derekwaynecarr
Copy link
Member Author

/cc @mffiedler @jeremyeder @sdodson

Copy link

@sjenning sjenning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly nits. nothing too important. just trying to make a confusing thing less confusing with uniform and precise wording.

@@ -73,9 +73,18 @@ introduction of allocatable resources.
An allocated amount of a resource is computed based on the following formula:

----
[Allocatable] = [Node Capacity] - [kube-reserved] - [system-reserved]
[Allocatable] = [Node Capacity] - [kube-reserved] - [system-reserved] - Hard-Eviction-Thresholds]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: missing [

----

[NOTE]
====
The reduction of `Hard-Eviction-Thresholds` from allocatable is a change in behavior to improve
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/reduction/withholding/

[[node-enforcement]]
== Node enforcement

The node is able to enforce the total amout of compute resources that pods
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: amount

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/enforce/limit/


The node is able to enforce the total amout of compute resources that pods
may consume based on the configured allocatable value. This feature significantly
improves the reliability of the node by preventing end-user pods from starving
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"end-user" as opposed to.. what? if pods are pods, i'd delete "end-user".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the future, we may let special users add pods to the "system" and "kube-reserved" cgroups. in that context "end-user" meant not those, but in this context, i agree its confusing.

may consume based on the configured allocatable value. This feature significantly
improves the reliability of the node by preventing end-user pods from starving
system services (i.e. container runtime, node agent, etc.) of required compute
resources. It is strongly encouraged that administrators set aside system
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/of required compute resources/for resources/

s/set aside/reserve/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid i.e and etc.
"(for example: container runtime, node agent, and so on)

been reclaimed. To avoid (or reduce the probability of) system OOMs the node
provides xref:../admin_guide/out_of_resource_handling.adoc[Out Of Resource Handling].

By reserving some memory via the `*--eviction-hard*` flag, the node attempts to evict
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a mixing of "setting", "flag", and "argument" terminology. We might want to unify this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/*--eviction-hard*/--eviction-hard

and system components use up all their reservation, the memory available for pods is `28.9Gi`,
and kubelet will evict pods when it exceeds this usage.

If we enforce node allocatable (`28.9Gi`) via top level cgroups, then pods can never exceeds `28.9Gi`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: exceed

If we enforce node allocatable (`28.9Gi`) via top level cgroups, then pods can never exceeds `28.9Gi`
in which case evictions would not be performed unless kernel memory consumption is above `100Mi`.

In order to support evictions and avoid memcg OOM kills for pods, the node sets the top level cgroup
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/memcg OOM killls for pods/pods being OOM killed inside the kubepods cgroup/

explaining the kubepods cgroup could be worth it so we don't have to use different, confusing language each time we are referring to it. "pod-level cgroup" already means something else so...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i want to avoid describing the cgroup hierarchy too much here to be honest.

is the same for user pods.

With the above example, node will evict pods whenever pods consume more than `28.9Gi` which will be
`<100Mi` from `29Gi` which will be the memory limits on the Node Allocatable cgroup.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Node Allocatable cgroup" == kubepods? I find this sentence confusing.

and kubelet will evict pods when it exceeds this usage.

If we enforce node allocatable (`28.9Gi`) via top level cgroups, then pods can never exceeds `28.9Gi`
in which case evictions would not be performed unless kernel memory consumption is above `100Mi`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So... I think this is made with assumption we are using all the node-allocatable enforcement groups; pods, system-reserved, and kube-reserved.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clarified.

- "pods" <3>
----
<1> Enable or disable the new cgroup hierarchy managed by the node. Any change
of this setting requires a full drain of the node. It is strongly encouraged
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/It is strongly encouraged that users not change this value."/We recommend that users do not change this value.

enforce node allocatable.
<2> The cgroup driver used by the node when managing cgroup hierarchies. This
value must match the driver associated with the container runtime. Valid values
are "systemd" and "cgroupfs". The default is "systemd".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/"systemd"/systemd
s/"cgroupfs"/cgroupfs

corresponding --kube-reserved-cgroup or --system-reserved-cgroup needs to be provided.
In future releases, the node and container runtime will be packaged in a common cgroup
separate from `system.slice`. Until that time, it is not encouraged for users to
change the default value of enforce-node-allocatable flag.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/it is not encouraged for users to change/we recommend not changing

separate from `system.slice`. Until that time, it is not encouraged for users to
change the default value of enforce-node-allocatable flag.

System daemons are expected to be treated similar to Guaranteed pods. System daemons
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who is expecting them to be treated similar?
"{product-title} expects system daemons to be treated similar..." ??
"System daemons should be treated similar..." ??

exhaustively to come up with precise estimates and are confident in their ability to
recover if any process in that group is oom_killed.

As a result, it is strongly recommended that users only enforce node allocatable for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/it is strongly recommended/we strongly recommend


By reserving some memory via the `*--eviction-hard*` flag, the node attempts to evict
pods whenever memory availability on the node drops below the reserved value.
Hypothetically, if system daemons did not exist on a node, pods cannot use more than
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Hypothetically"?? Do we not know for certain?

s/did not/do not

In order to support evictions and avoid memcg OOM kills for pods, the node sets the top level cgroup
limits for all pods to be `Node Allocatable + Eviction Hard Thresholds`.

However, the scheduler is not expected to scheduler more than `28.9Gi` and so `Node Allocatable` on node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/to scheduler/to schedule

@mburke5678
Copy link
Contributor

@derekwaynecarr Some style comments.

@derekwaynecarr
Copy link
Member Author

@sjenning @mburke5678 -- made some updates, ptal

Copy link

@sjenning sjenning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a few more. then lgtm.

[[node-enforcement]]
== Node enforcement

The node is able to limit the total amount of compute resources that pods
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove "compute"


Administrators should treat system daemons similar to Guaranteed pods. System daemons
can burst within their bounding control groups and this behavior needs to be managed
as part of cluster deployments. Enforcing system-reserved reservations
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/reservations/limits/

provides xref:../admin_guide/out_of_resource_handling.adoc[Out Of Resource Handling].

By reserving some memory via the `--eviction-hard` flag, the node attempts to evict
pods whenever memory availability on the node drops below the reserved value.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/reserved value/certain value or percentage/

By reserving some memory via the `--eviction-hard` flag, the node attempts to evict
pods whenever memory availability on the node drops below the reserved value.
If system daemons did not exist on a node, pods are limited to the memory
`capacity - eviction-hard`. For this reason, resources reserved for evictions are not
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/reserved for evictions/set aside as a buffer for eviction before reaching out of memory conditions/

@vikram-redhat
Copy link
Contributor


As a result, we strongly recommended that users only enforce node allocatable for
`pods` by default, and set aside appropriate reservations for system daemons to maintain
overall node reliability.
Copy link

@qwang1 qwang1 Jun 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean, it's recommended likes this?

kubeletArguments:
  cgroups-per-qos:
    - "true" 
  cgroup-driver:
    - "systemd"
  enforce-node-allocatable:
    - "pods" 
  kube-reserved:
    - "cpu=200m,memory=30G"
  system-reserved:
    - "cpu=200m,memory=30G"
  eviction-hard:
    - "memory.available<1Gi"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qwang1 - for the described scenario, this would be the example configuration.

we explicitly avoid defining a recommendation for the reservation at this time as its a factor of pod density.

kubeletArguments:
  cgroups-per-qos:
    - "true" 
  cgroup-driver:
    - "systemd"
  enforce-node-allocatable:
    - "pods" 
  kube-reserved:
    - "memory=2GI"
  system-reserved:
    - "cpu=1,memory=1Gi"
  eviction-hard:
    - "memory.available<100Mi"

@derekwaynecarr
Copy link
Member Author

@sjenning - updated.

@mburke5678
Copy link
Contributor

mburke5678 commented Jun 19, 2017

@derekwaynecarr Is this ready for merge?
Is the associated Trello, Need guarantees on OS cgroup reservation, related to Enforce QoS level memory limits, as they are both discuss cgroups?

@derekwaynecarr
Copy link
Member Author

this is ready to merge. this doc should cover https://trello.com/c/QPRTj0uW/524-document-node-need-guarantees-on-os-cgroup-reservation-ops-rfe , @sjenning should write a follow-up to document the alpha support for https://trello.com/c/wq1qcjDa/455-document-node-enforce-qos-level-memory-limits which is a new flag --experimental-qos-reserved

@mburke5678 mburke5678 merged commit fc53aa1 into openshift:master Jun 19, 2017
@mburke5678 mburke5678 added this to the Future Release milestone Jun 19, 2017
@vikram-redhat vikram-redhat modified the milestones: Future Release, Staging Jul 7, 2017
@vikram-redhat vikram-redhat removed this from the Future Release milestone Aug 9, 2017
@vikram-redhat vikram-redhat modified the milestones: Staging, Future Release, OCP 3.6 GA Aug 9, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants